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1. INTRODUCTION 

The theory of conjunctive queries over relational structures is, from a certain point 
of view, the greatest success story of the theory of database queries. These queries 
correspond to the most common queries in database practice, e.g. SQL select-from- 
where queries with conditions combined using "and" only. Their evaluation problem 
has also been considered in different contexts and under different names, notably as 
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the Constraint Satisfaction problem in AI [Kolaitis and Vardi 1998; Dechter 2003] 
and the H-coloring problem in graph theory [Hell and Nesetril 2004]. Conjunctive 
queries arc surprisingly wcll-bchavcd: Many important properties hold for conjunc- 
tive queries but fail for more general query languages (cf. [Chandra and Merlin 
1977; Abiteboul et al. 1995; Maier 1983]). 

Unrankcd labeled trees arc a clean abstraction of HTML. XML, LDAP, and lin- 
guistic parse trees. This motivates the study of conjunctive queries over trees, where 
the tree structures are represented using unary node label relations and binary re- 
lations (often referred to as axes) such as Child, Descendant, and Following . 

XML Queries. Conjunctive queries over trees are naturally related to the prob- 
lem of evaluating queries (e.g., XQucry or XSLT) on XML data (cf. [Deutsch and 
Tannen 2003a]). However, conjunctive queries arc a cleaner and simpler model 
whose complexity and expressiveness can be formally studied (while XQuery and 
XSLT are Turing-complete). 

(Acyclic) conjunctive queries over trees arc a generalization of the most frequently 
used fragment of XPath. For example, the XPath query //A[B]/following::C is 
equivalent to the (acyclic) conjunctive query 

Q{z) <^ A{x), Child{x,y), B{y), Following{x,z), C{z). 

While XPath has been studied extensively (see e.g. [Gottlob et al. 2005; Gottlob 

et al. 2005] on its complexity, [Benedikt et al. 2003; Oltcanu et al. 2002] on its 
expressive power, and [Hidders 2003] on the satisfiability problem), little work so 
far has addressed the theoretical properties of cyclic conjunctive queries over trees. 
Sporadic results on their complexity can be found in [Meuss et al. 2001; Gottlob 
and Koch 2002; 2004; Meuss and Schulz 2001]. 

Data extraction and integration. (Cyclic) conjunctive queries on trees have 
been used previously in data integration, where queries in languages such as XQuery 
were canonically mapped to conjunctive queries over trees to build upon the existing 
work on data integration with conjunctive queries [Deutsch and Tannen 2003a; 
2003b]. Another application is Web information extraction using a datalog-like 
language over trees [Baumgartner et al. 2001; Gottlob and Koch 2004]. (Of course, 
each nonrecursive datalog rule is a conjunctive query.) 

Queries in computational linguistics. A further area in which such queries are 

employed is computational linguistics, where one needs to search in, or check prop- 
erties of, large corpora of parsed natural language. Corpora such as Penn Treebank 
[LDC 1999] are unranked trees labeled with the phrase structure of parsed (for 

Treebank, financial news) text. A query asking for prepositional phrases following 
noun phrases in the same sentence can be phrased as the conjunctive query 

Q{z) ^ S{x), Descenda,nt{x,y), NP{y), Descendant{x, z), PP{z), Following{y , z) . 

Figure 1 shows this query in the intuitive graphical notation that we will use 
throughout the article (in which nodes correspond to variables, node labels to unary 
atoms, and edges to binary atoms). 

Dominance constraints. Another important issue in computational linguistics 
are conjunctions of dominance constraints [Marcus et al. 1983], which turn out to 
be equivalent to (Boolean) conjunctive queries over trees. Dominance constraints 
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Fig. 1. A query graph. 



have been influential as a means of incompletely specifying parse trees of natu- 
ral language, in cases where (intermediate) results of parsing and disambiguation 
remain ambiguous. One problem of practical importance is the rewriting of sets 
of dominance constraints into equivalent but simpler sets (in particular, so-called 
solved forms [Bodirsky et al. 2004], which correspond to acyclic queries). This im- 
plies that studying the expressive power of conjunctive queries over trees, and the 
problem of deciding whether there is a set of acyclic conjunctive queries equivalent 
to a given conjunctive query, is relevant to computational linguistics. 

Higher-order unification. The query evaluation problem for conjunctive queries 
over trees is also closely related to the context matching problem^, a variant of the 
well-known context- unification problem [Schmidt-Schaufi and Schulz 1998; 2002]. 
Some tractability frontier for the context matching problem is outlined in [Schmidt- 
Schaufi and Stuber 2001]. However, little insight is gained from this for the database 
context, since the classes studied in [Schmidt-Schaufi and Stuber 2001] become 
unnatural when formulated as conjunctive queries^. 

Contributions 

Given the substantial number of applications that we have hinted at above and 
the nice connection between database theory, computational linguistics, and term 
rewriting, it is surprising that conjunctive queries over trees have never been the 
object of a concerted study'^. 

In particular, three questions seem worth studying: 

(1) The complexity of (cyclic) conjunctive queries on trees has only been scratched 

in the literature. There is little understanding of how the complexity of con- 
junctive queries over trees depends on the relations used to model the tree. 

(2) There is a natural connection between conjunctive queries and XPath. Since 
all XPath queries are acyclic, the question arises whether the acyclic positive 
queries (i.e., unions of acyclic conjunctive queries) are as expressive as the full 
class of conjunctive queries over trees. ^ 



-"^To be precise, tlie analogy is most direct with ranked trees. 

^These conjunctive queries require node inequality as a binary relation in addition to the tree 
structure relations. If 7^ is removed, the queries become acyclic. However, it is easy to see that 
already conjunctive queries using only the inequality relation over a fixed tree of three nodes are 
NP-complete, by a reduction from Graph 3-Colorability. 

^Of course, as mentioned above, there are a number of papers that implicitly contain relevant 
results [Meuss et al. 2001; Meuss and Schulz 2001; Bidders 2003; Schmidt-Schaufi and Stuber 2001]. 
The papers [Hell et al. 1996a; 1996b] address the complexity of a notion of tree homomorphisms 

that is uncomparable to the one used in database theory, and the results there are orthogonal. 
*This is equivalent to asking whether for all conjunctive queries over trees there exist equivalent 
positive Core XPath queries [Gottlob et al. 2005]. 
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(3) If that is the case, how much bigger do the acychc versions of queries get 
than their cychc counterparts? Except from being of theoretical interest, first 
translating queries into their acycHc versions, if that is possible, and then evalu- 
ating them as such may be a practical query evaluation strategy, because there 
are particularly good algorithms for evaluating such queries [Yannakakis 1981; 
Chekuri and Rajaraman 1997; Flum et al. 2002; Gottlob and Koch 2004]. 

We thus study conjunctive queries on tree structures represented using the XPath 
axis relations child, descendant, descendant- or- self , following- sibling, and follow- 
ing. Since we are free to use these relations with any pair of variables of our con- 
junctive queries (differently from XPath), these five axes render all others, i.e. par- 
ent, ancestor, ancestor- or- self , preceding- sibling , and preceding, redundant. Typed 
child axes such as attribute are redundant with the child axis and unary relations 
in our framework. 

For a more elegant framework, we study the axes Child, Child* {— descendant- or- 
self), Child^ (= descendant), NextSibling, NextSibling* , NextSibling'^ {= following- 
sibling), and Following. {NextSibling and NextSibling* are not supported in XPath 
but are nevertheless considered here.) Subsequently, we denote this set of all axes 
considered in this article by Ax. 

The main contributions of this article are as follows. 

— In [Gutjahr et al. 1992] it was shown that the H-coloring problem (cf. [Hell 
and Nesetril 2004]), and thus Boolean conjunctive query evaluation, on directed 
graphs H that have the so-called X-property (pronounced "X-underbar-property" ) 
is polynomial-time solvable. We determine which of our axis relations have the 
X-property with respect to which orders of the domain elements. We show that 
the subset-maximal sets of axis relations for which the X-property yields tractable 
query evaluation arc the three disjoint sets 

{Child, NextSibling, NextSibling* , NextSibling'^}, 

{ChiW , Child^}, and {Following}. 

— We prove that the conjunctive query evaluation problem for queries involving 
any two axes that do not have the X-property with respect to the same ordering 
of the tree nodes is NP-complete. 

Thus the X-property yields a complete characterization of the tractability frontier 
of the problem (under the assumption that P ^ NP). 

Theorem 1.1. Unless P = NP, for any F C Ax, the conjunctive queries over 
structures with unary relations and binary relations from F are in P if and only 
if there is a total order < such that all binary relations in F have the X-property 
w.r.t. <. 

Moreover we have the dichotomy that for any of our tree structures, the conjunc- 
tive query evaluation problem is either in P or NP-completc. 
Table I shows the complexities of conjunctive queries over structures containing 
unary relations and either one or two axes.^ 

^It was shown in [Gottlob and Koch 2004] that conjunctive queries over Child and NextSibling 
are in P. Proposition 4.2 is from [Gottlob and Koch 2002].) The other results are new. 
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Table I. Complexity results for signatures with one or two axes, with pointers to relevant theorems. 



All NP-hardnoss results hold already for fixed data trees (query complexity [Vardi 
1982]). The polynomial-time upper bounds are established under the assumption 
that both data and query are variable (combined complexity). 
— We study the expressive power of conjunctive queries on trees. We show that for 
each conjunctive query over trees, there is an equivalent acyclic positive query 
(APQ) over the same tree relations. The blowup in size of the APQs produced 
is exponential in the worst case. 

It follows that there is an equivalent XPath query for each conjunctive query over 
trees, since each APQ can be translated into XPath (even in linear time). 
— Finally, we provide a result that sheds some light at the succinctness of (cyclic) 
conjunctive queries and which demonstrates that the blow-up observed in our 
translation is actually necessary. We prove that there are conjunctive queries 
over trees for which no equivalent polynomially-sized APQ exists. 

The structure of the article is as follows. We start with basic notions in Section 2. 

Section 3 introduces the X-propcrty and the associated framework for finding classes 
of conjunctive queries that can be evaluated in polynomial time. Section 4 contains 
our polynomial-time complexity results. Section 5 completes our tractability fron- 
tier with the NP-hardncss results. In Section 6, we provide our expressiveness 
results. Finally, we present our succinctness result in Section 7. 

2. PRELIMINARIES 

Let S be a labeling alphabet. Throughout the article, if not explicitly stated other- 
wise, we will not assume S to be fixed. An unranked tree is a tree in which each node 
may have an unbounded number of children. Wc allow for tree nodes to be labeled 
with multiple labels. However, throughout the article, our tractability results will 
support multiple labels while our NP-hardness and expressiveness results will not 
make use of theiii. 

We represent trees as relational structures using unary label relations {Labela)ae's 
and binary relations called axes. For a relational structure A, let A = \A\ denote 
the finite domain (in the case of a tree, the nodes) and let ||^|| denote the size 
of the structure under any reasonable encoding scheme (see e.g. [Ebbinghaus and 
Flum 1999]). We use the binary axis relations Child (defined in the normal way) 
and NextSihling (where NextSibling{v, w) if and only if w is the right neighboring 
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sibling of w in the tree), their transitive and reflexive and transitive closures (denoted 
Child^, NextSibling^ , Chilct, NextSibling*) , and the axis Following, defined as 

Following{x,y) = 3z\3z2 Childt{z\,x) A NextSibling'^ {z\, Z2) A Chil(f{z2,y)- (1) 

This set of axes covers the standard XPath axes (cf. [World Wide Web Consortium 
1999]) by the equivalences Chilct' = Descendant, Child* = Descendant- or- self, and 
NextSibling'^ = Following-sibling. 

We consider three well-known total orderings on finite ordered trees. The pre- 
order <pre corresponds to a depth-first Icft-to-right traversal of a tree. If XML- 
documents are represented as trees in the usual way, the prc-order coincides with the 
document order. It is given by the sequence of opening tags of the XML elements 
(corresponding to nodes). The post-order <posL corresponds to a bottom- up Icft-to- 
right traversal of the tree and is given by the sequence of closing tags of elements. 
Furthermore, we also consider the ordering <bfir which is given by the sequence of 
opening tags if wc traverse the tree breadth-first Icft-to-right. 

The fc-ary conjunctive queries can be defined by positive existential first-order 
formulas without disjunction and with k free variables. We will usually use the 
standard (datalog) rale notation for conjimctivc queries (cf. [Abitcboul et al. 1995]). 

We call the 0-ary queries Boolean and the unary queries monadic. The contain- 
ment of queries Q and Q' is defined in the normal way: Query Q is said to be 
contained in Q' (denoted Q C Q') iff, for all tree structures A, Q' returns at least 
all tuples on A that Q returns on A. (To cover Boolean queries, tuples here may 
be nullary.) Two queries Q, Q' are called equivalent iff Q C Q' and Q' CQ. 

Let Q be a conjunctive query and let Var{Q) denote the variables appearing in 
Q. The query graph of Q over unary and binary relations is the directed multigraph 
G = {V, E) with edge labels and multiple node labels such that V = Var{Q), node 
x is labeled P iff Q contains unary atom P{x), and E contains labeled directed edge 

X y if and only if Q contains binary atom R{x, y). Figure 1 shows an example of 
such a query graph. Our notion of query graph is sometimes called positive atomic 
diagram in model theory or the graph of the canonical database of a query in the 
database theory literature. 

Throughout the article, we use lower case node and variable names and upper 
case label and relation names. 

3. THEX-PROPERTY 

Let Q be a conjunctive query and let A denote the finite domain, i.e. in case of a 
tree the set of nodes. A pre-valuation for Q is a total function Q : Var{Q) 2^ 
that assigns to each variable of Q a nonempty subset of A. A valuation for Q is a 
total function 6 : Var{Q) A. 

Let ,4 be a relational structure of unary and binary relations. A pre-valuation G 
is called arc- consistent^ iff for each unary atom P{x) in Q and each v E Q{x), P{v) 
is true (in A) and for each binary atom R{x, y) in Q, for each v £ Q{x) there exists 
w G 0(j/) such that R{v,w) is true and for each w € Q{y) there exists v G 9(a;) 
such that R{v, w) is true. 



®This notion is well-known in constraint satisfciction, cf. [Dechter 2003]. 

6 



Proposition 3.1 (Folklore). There is an algorithm which checks in time 
0(11^11 • \Q\) whether an arc- consistent pre-valuation of Q on A exists, and if it 
does, returns one. 

Proof. Wc phrase the problem of computing O by deciding, for each x,v, 
whether v ^ Q{x) as an instance V of propositional Horn-SAT. The propositional 
predicates are the atoms Remove{x,v) (where x G Vars{Q), v G A are constants), 
and the Horn clauses are 

{Remove{x,v) ^ . \ P{x) eQ, v e A, -.P-^(w)} U 

{Rem,ove{x , v) ^ /\{Remove{y,w) \ R-^{v,'w)}. \ R{x,y) gQ, v G A} U 
{Remove{y,w) ^ f\{Remove{x,v) \ i?'^(w,w)}. | R(x,y) GQ, w G A} 

Let Remove be the binary relation defined by V and let 

T = ( Vars{Q) x ^4) - Remove 

be the complement of that relation. If there is a variable x such that for no node 
{x, v) G T, no arc-consistent pre-valuation of Q on ^ exists and Q is not satisfied. 
Otherwise, the pre-valuation defined by 

e(x) ^ {v I {x,v) G T}, 

for each x, is obviously arc-consistent and contains all arc-consistent pre- valuations 
of Q and A. 

Program V can be computed and solved (e.g. using Minoux' algorithm [Minoux 
1988]), and the solution complemented, in time linear in the size of the program, 
which is 0(11^11 • IQI). □ 

Actually, this algorithm computes the unique subset-maximal arc-consistent pre- 
valuation of Q on A. 

A valuation is called consistent if it satisfies the query. In this case, for a 
Boolean query, we also say that the structure is a model of the query and the 
valuation a satisfaction. Obviously, a valuation is consistent if and only if the pre- 
valuation 6 defined by 0(a;) i-^ is arc-consistent. Let < be a total order on 
A = 1^1 and G be a pre-valuation. Then the valuation 6 with 9{x) w iff w is the 
smallest node in Q{x) w.r.t. < is called the minimum valuation w.r.t. < in 6. 

Definition 3.2. Let A bo a relational structure, R a binary relation in A, and 
< a total order on ^4 = |^|. Then, R is said to have the X-property w.r.t. < iff for 
all no, rii, n^, G A such that no < ni and n2 < na, 

i?(ni, n2) A i?(no, na) =^ i?(no, n2). 

Figure 2 illustrates why the property is called X (read as "X-underbar"). Let 
us consider two vertical bars both representing the order < bottom-up (i.e., with 
the smallest value at the bottom). Let each edge (v/, v) in R be represented by an 
arc from node u on the left bar to node v on the right bar. Then, whenever there 
are two crossing arcs {u,v) and {u',v') in this diagram, then there must be an arc 
{mm{u,u'),m.m{v,v')), the "underbar", in the diagram as well. 
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Fig. 2. The X-property. Graph (a) and its illustration by arcs between two bars (b). For crossing 
arcs R{u,v) and R{u',v'), say u < u' and v' < v, there must be an arc R{u,v'). 

Remark 3.3. The X-property'' was introduced in [Gutjahr et al. 1992], where 
it was shown that the iJ-coloring problem (or equivalently the conjunctive query 
evahiation problem) on graphs H with the X-propcrty is polynomial-time solvable 
(see also [Hell and Nesetril 2004]). In the remainder of this section, we rephrase 
this result as a tool for efficiently evaluating conjunctive queries. 

Let ^ be a structure of unary and binary relations and let < be a total order on 
1^1 . Structure A is said to have the X-property w.r.t. < if all binary relations R in 
A have the X-property w.r.t. <. 

Lemma 3.4. Let A he a structure with the X-property w.r.t. < and let O be an 
arc- consistent pre- valuation on A for a given conjunctive query over the relations 
of A. Then, the minimum valuation in G w.r.t. < is consistent. 

Proof. Let 6 denote the minimum valuation in 9 w.r.t. <. To prove 9 consis- 
tent, we only need to show the following: If R{x, y) is any binary atom of Q with 
variables x, y then R{x, y) holds under assignment 6, i.e. R{9{x), 9{y)) is true in A. 

Let 0{x) = no and 9{y) = n^- Since Q is arc-consistent there exists a node 
n\ G 0(x) such that i?(ni, 712) and a node S Q{y) such that i?(no, ^3). If no = ni 
or 712 — ns then R{9{x),6{y)) = R{no,n2) is true and we are done. Otherwise, 
since ^ is a minimum valuation we have no < ni (because no = 9{x) = min8(x), 
ni S Q{x), and no 7^ ni) and n2 < na (because n2 — 6{y) = min0(y), ns £ 0(y), 
and 712 7^ ^^3)- Then it follows from Definition 3.2 that ii(7io,7i2). □ 

Clearly, if no arc-consistent pre-valuation of Q on ^ exists, there is no consistent 
valuation for Q on A. 

Theorem 3.5. Given a structure A with the X-property w.r.t. < and a Boolean 
conjunctive query Q over A, Q can be evaluated on A in time 0(||^|| • |<5|)- 

Proof. By Lemma 3.4, all we need to do to check whether a Boolean query Q 

is satisfied is to try to compute the subset-maximal arc-consistent pre-valuation O 
with respect to Q. By Proposition 3.1, this can be done in time 0(||^|| • \Q\). If it 
exists, Q returns true; otherwise, Q returns false. □ 

If follows that checking whether a given tuple (ai,...,afc) is in the result of 
a fc-ary conjunctive query on structures with the X-property w.r.t. some order 



'^In [Gottlob et al. 2004], this property was called hemichordality . 
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can be decided in time 0(||^|| • \Q\) as well. All we need to do is to add (new) 
singleton unary relations Xi = {ai}, . . . ,Xh = {flfc} to A and to rewrite the query 

Q{xi, . . . ,Xk) . . . ,Xk) into the Boolean query Q ^ ^{xi, . . . , a::fe)AXi(.Ti)A 

• • -AXkixk)- A A;-ary conjunctive query Q over A with A = \A\ can thus be evaluated 
on^in tinieO(|A|*=-p||-|Q|). 

For relations that arc subsets of the given total order < (the reflexive closure of 
<), a slightly stronger condition for the X-propcrty w.r.t. < can be given. 

Lemma 3.6. Let A be a structure, < a total order on A = \A\, and R a binary 
relation of A such that R C<. Then, R has the X-property w.r.t. < iff for all 
no, ni, 712, ns G A such that uq < ui <n2 < ns, 

R{ni,n2) A R{nQ, ns) ^ i?(no, 712). 

Proof. Obviously, if R has the X-propcrty w.r.t. <, then the condition of Defi- 
nition 3.2 implies that the condition of our lemma holds. Conversely, since for all 
711,712, R{n\,n2) => 7ii < 712, by our lemma, for all 710,711,712,713 e A such that 
7io < 7ii and 7i2 < 713, R{ni, 712) A R{no, ns) R{no, n2). □ 

A symmetric version of Lemma 3.5 holds for relations C>. 

Lemma 3.7. Let ^ be a structure, < a total order on A = \A\, and R a binary 
relation of A such that R C>. Then, R has the X-property w.r.t. < iff for all 
no, ni, n2, € A such that no < n-i < 712 < na, 

R{n2, ni) A i?(n3, no) =^ R{n2, no). 

Proof. Let R' = R~^. By Lemma 3.6, R' has the X-property w.r.t. < precisely 
if for aU no, ni, n2, ns G A, no < ni < n2 < n3Ai?'(ni, n2)Ai?'(no, na) i?'(no, n2). 
Thus, R has the X-property w.r.t. < iff the condition of our lemma holds. □ 

4. POLYNOMIAL-TIME RESULTS 

The results of Section 3 provide us with a simple technique for proving polynomial- 
time complexity results for conjunctive queries over trees. Indeed, there is a wealth 
of inclusions of axis relations in the total orders introduced in Section 2: 

(1) all the axes in Ax are subsets of the pre-order <pre, 

(2) Child~'^, (Child^)-'^, {Child*)~'^, Following, NextSibling, NextSibling'^ , and 

NextSibling* are subsets of the post-order <post, and 

(3) Child, Child^ , Child*, NextSibling, NextSibling'^ , and NextSibling* are subsets 
of the order <ijflr- 

Using Lemma 3.6, it is straightforward to show that 

Theorem 4.1. The axes 

(1) Child^ and Child* have the X-property w.r.t. <pre, 

(2) Following has the X-property w.r.t. <postt o-^d 

(3) Child, NextSibling, NextSibling*, and NextSibling'^ have the X-property w.r.t. 
<bflr- 
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Proof. All proof arguments use Lemma 3.6. 

We first show that Child* has the X-property w.r.t. <pre- (The proof for Child^ 

is similar.) Consider the nodes T?,o,...,n3 such that "-o <pre ^^i <pre <pre '^^3, 
Child* {no, ns), and Child* {ni,n2)- It is simple to see that <pre is the disjoint 
union of Child!' and Following. Therefore, either Child!' {no, m), which implies 
Child* [no, n^), or Following{no,ni). The latter case would yield naKpreTii, a con- 
tradiction. 

Next, we show that Following has the X-property w.r.t. <post- Assume that 

no < postal <post'n2 < post IT'S 

and Following{n\,n2), Following{no,n3). Clearly, the relation <post is the dis- 
joint union of Following and the inverse of Child*. Since no<postni is true, ei- 
ther Child* {ni, no) or Following{no,ni) must hold. In both cases it follows that 
Following{no,n2). Thus, Following has the X-property w.r.t. <post- 

The fact that Child has the X-propcrty w.r.t. <bflr follows vacuously from the 
characterization of Lemma 3.6: Assume that no <bflr ^bflr '^2 <bflr '^-3 ^^'^ that 
Child{ni,n2) (thus no <bflr <bflr <bflr ^3) and Child{no,nz). Because of 
Child{no,n^), the node ni is at most one level below no in the tree. There arc 
two cases, (1) Following{no, rii) and no, ni are on the same level in the tree or (2) 
Following{ni,n^) and ni, n^, are on the same level in the tree. In case (1), since 
n^ is a child of no and 712 is a child of ni, n^ <bflr contradiction. In case (2), 
since 712 is a child of Ui, 712 is one level below 77.3 in the tree and thus 773 <bflr ^2, 
contradiction. 

It is easy to verify that NextSihling, NextSibling* , and NextSibling'^ have the 
X-property w.r.t. <bflr using Lemma 3.6. □ 

Now, it follows immediately from Theorem 3.5 that 

Corollary 4.2 [Gottlob and Koch 2002]. Conjunctive queries over 
n := {{Labela)a£T,, Child^ , Child!) 
are in polynomial time w.r.t. combined complexity. 
Corollary 4.3. Conjunctive queries over the signature 
T2 := {{Labela)ae^y Following) 
are in polynomial time w.r.t. combined complexity. 
Corollary 4.4. Conjunctive queries over the signature 

Ts := {{Labela)ae^j Child, NextSibling, NextSibling* , NextSibling'^) 
are in polynomial time w.r.t. combined complexity. 

Example 4.5. The remaining inclusions between axis relations and total orders 
introduced at the beginning of this section do not extend to the X-property. For 
example. Figure 3 (a) illustrates that Following does not satisfy Lemma 3.6 w.r.t. 
pre-order <pre- (While 2<pr-e3<pre4<pre6, Following{2,Q) and Following{2>,4) hold, 
Following{2, 4) does not hold.) 
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Fig. 3. (a) Following does not have the X-propcrty w.r.t. <pre', (b) Descendant ^ and 
Descendant-or-self~^ do not have the X-property w.r.t. <post- 

Figure 3 (b) shows that Descendant~^ and Descendant- or- s elf ^ do not satisfy 
the condition of Lemma 3.7 w.r.t. post-order <post- (While l<post'i<post^<post5, 
DescendanV^ {1, 5) and Descendant~^ {3, 4) hold, Descendant^^{l, 4) does not hold.) 

For total order <, let Succ^ := {{x,y) \ x < y A $z x < z < y}. It is trivial 
to verify that Succ^, <, and < have the X-property w.r.t. <. Thus, we may for 
instance add the relations <pre (document order) and Succ<_^^^ ("next node in 
document order" ) to n , while retaining polynomial-time combined complexity. 

5. NP-HARDNESS RESULTS 

In this section, we study the complexity of the conjunctive query evaluation problem 
for the remaining sets of axis relations. For all cases for which our techniques based 
on the X-property do not yield a polynomial-time complexity result, we are able 
to prove NP-hardness. All NP-hardness results hold already for query complexity, 
i.e., in a setting where the data tree, and thus in particular the labeling alphabet, 
is fixed and only the query is assumed variable. 

All reductions are from one-in-three 3SAT, which is the following NP-complete 
problem: Given a set U of variables, a collection C of clauses over U such that each 
clause C € C has |C| = 3, is there a truth assignment for U such that each clause 
in C has exactly one true literal? l-in-3 3SAT remains NP-complete if all clauses 
contain only positive literals [Schaefer 1978]. 

Below, we will use shortcuts of the form x'^(a;, y), where x is an axis, in queries to 
denote chains of k x-atoms leading from variable x to y. For example, Chil£'{x, y) 
is a shortcut for Child{x, z). Child{z,y), where ^ is a new variable. 

The first theorem strengthens a known result for combined complexity [Meuss 
et al. 2001] to query complexity. 

Theorem 5.1. Conjunctive queries over the signatures 
T4 := {{Labela)ae^: Child, Child^) 
T5 := {{Labela)ae'Sj Child, Child*) 
are NP-complete w.r.t. query complexity. 

Proof. Here, as in all other proofs of this section, we only need to show NP- 
hardness. Let Ci , . . . , Cm be a l-in-3 3SAT instance with positive literals only. We 
assume that Ci is an ordered sequence of three positive literals. We may assume 
without loss of generality that no clause contains a particular literal more than 
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Fig. 4. Data tree of the proof of Theorem 5.1. 

once. Wc reduce this instance to one of the Boolean conjunctive query evaluation 
problem for T4 (t^). 

The fixed data tree over alphabet {X, Y, Li, Z/2, ^3} is shown in Figure 4. 

For the query, we introduce variables Xi , Ui for I < i < m and in addition a 
variable zi-^i^ij whenever the k-th literal of Q coincides with the Z-th literal of Cj 
{l<i< m, 1 <j<m,i^j,l<k,l<3). 

The Boolean query consists of the following atoms: 

— for 1 <i <m, 

Y{yi), Chil(f{xi,yi), 

— for each variable Zk,i,i,j, 

Lkizk,is,j), Child°{yi, Zkj,ij), Chil(f+''~' {xj, Zk,i,i,j) 
where o is "+" on signature T4 and "*" on T5. 

"=>" . To prove correctness of the reduction, we first show that given any solution 
mapping a : {1,. . . ,m} — *■ {1, 2, 3} of Ci, . . . , Cm (i.e., (T{i) ^ k' iS a selects the 
fc'-th literal from d) we can define a satisfaction 6 of the query. We first define a 
valuation 9 of our query and then show that all query atoms are satisfied. We set 

— 0{xi) := Vcr(i) for 1 < i < m, 

—6{yi) := ■u'a(i),<7(i) for 1 < i < m, and 

— for each variable z^^i^j, 0{zk,i,i,j) ■= w^(i),5+fe-;+<70)- 

We now prove that ^ is a satisfaction of the query. Our choice of 9 implies that 
the variables Xi and j/i are mapped to nodes with labels X and Y, respectively. 
Furthermore, 9{yi) = Wcr(i),a(i) can be reached from 9{xi) = ^^(i) with three child- 
steps. For any variable of the form Zk,i,i,j, 6{zk,i,i,j) = Wo-(i),5+fe-(+<T0) is always 
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Fig. 5. Clause gadget of proof of Theorem 5.2. 



a Chilif of w;cr(i),<T(j)- If cr{i) ^ k, then 0{zk,i,i,j) = tyo-(i),5+fc-(+<T0) has label Lk 
because 4 < 5 + fc — Z + a{j) < 10 and the nodes Wcr{i)^4, ■ ■ ■ , Wo-(i),io all have (at 
least) the two labels Lk' for which a{i) ^ k'. If a{i) = k, then a{j) = I. By 
going 8 + k — I steps downward from ^^(j)! passing through Wk,k, we reach node 
Wk,5+k, which has label Lk- Since 9{zk,i,i,j) = w„(^o,5+k-i+aU) = Wk,5+k, the query 
atoms Chil(f'^'^~\xj,Zk,i,ij) are satisfied. Therefore, 6 is indeed a satisfaction of 



"<=" . To finish the proof we show that from any satisfaction 9 of the query 
wo obtain a corresponding solution for the l-in-3 3SAT instance Ci,...,Cto- If 
0{xi) = Vk, wc interpret this as the fc-th literal of clause Ci being chosen to be true. 
Obviously, under any valuation of the query, we select precisely one literal from 
each clause d. Wc have to verify that if a literal L occurs in two clauses d and 
Cj and we select L in Ci , we also select L in Cj . Let L be the fc-th literal of d and 
let 0{xi) = Vk (i.e., L is selected in Cj). Then 0{zk,i,ij) = Wk,5+k because that is 
the only node below 6{yi) = Wk,k that has label Lk- The query contains the atom 
Chil(f'^^~^{xj^ Zk.i,i.j) for variable Zk.i^i.j. From node Wk,5+k, hy 8 + k — I upward 
steps we arrive at node vi. Hence 0{xj) = vi, and we select L from clause Cj. 

Some nodes in the data tree carry multiple labels. However, since the Child axis 
is available in both T4 and T5, multiple labels can be eliminated by pushing them 
down to new children in the data tree and modifying the queries accordingly. □ 

Theorem 5.2. Conjunctive queries over the signature 



our query. 



Tg {{Labela)aes 
are NP- complete w.r.t. query complexity. 



Child, Following) 
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Proof. Figure 5 shows (a) the fragment of a data tree and (b) a query over the 
labeling alphabet S = {A, B, C, L\, L2, L^}. 

Observe that the labels Li, L2, and L3 occur only once each in Figure 5 (b). We 
will refer to the nodes (= query variables) labeled Li, L2, and L3 by vi, V2, and V3, 
respectively. For the following discussion, we have annotated some of the nodes of 
the data tree with numbers (1-7). Below, node 1 (rcsp. 3, 6) is called the topmost 
position of variable vi (rcsp. V2, W3). Wc start with two simple observations. 

(1) In any satisfaction 9 of the query on the data tree, at most one of the variables 
vx, V2, and is mapped to its topmost position under 9. In fact, assume, e.g., 
that 6{vi) = 1. From node 1, node 3 (rcsp. 6) cannot be reached by a sequence 
of 2 (resp. 7) Following-steps. Hence we have 9{v2) 3 and div'i) ^ 6. 

(2) In any solution 9 of the problem, at least one of the variables v\, V2, and V3 
is mapped to its topm,ost position under 9. In fact, assume that 9{vi) = 2 and 
^(f 2) 7^ 3. The atoms in the query (in particular, on the variables corresponding 
to nodes on the bottom of the query graph) require that ^(^2) 7^ 4. Hence 
9{v2) = 5 is the only remaining possibility. But now the query requires that 
9{v3) ^ 7. Hence 9{v3) = 6. 

Thus, precisely the three partial assignments 



(a) 9{v,) 


1, ^(^^2) 


:=4, 


^(^^3) 


:= 7 


(6) 9{v,) 


:= 2, 9{V2) 


:= 3, 




:= 7 


(c) 9{v,) 


:= 2, 9{V2) 


:= 5, 


^(^^3) 


:=6 



can be extended to a satisfaction of the query. Precisely one of the variables vi, 
V2, and V3 is mapped to its topmost position under each of the above assignments. 
Conversely, for each variable there is a satisfying assignment in which it takes its 
topmost position. 

Given a clause C, an ordered list of three positive literals, we interpret a satis- 
faction 9 in which variable Vk is mapped to its topmost position as the selection of 
the k-th literal from C to be true. The encoding described above thus assures that 
exactly one variable of clause C is selected and becomes true. 

Now consider a l-in-3 3SAT problem instance over positive literals with clauses 
C\. . . . , Cm- We encode such an instance as a conjunctive query over re and a fixed 
data tree over labeling alphabet S = {A,B,C,Li,L2,Ls}. This tree consists of 
two copies of the tree of Figure 5 (a) under a common root, i.e.. 




where T denotes the tree of Figure 5 (a). 

The query is obtained as follows. Each clause Ci is represented using two copies 
of the query gadget of Figure 5 (b) (a "left" copy Qi and a "right" copy Q^). We 
wire the two sets of subqueries Qi, . . . , Qm, Q'l, ■ ■ ■ , Q'm follows. 

Consider first the integer function NAND{k,l) defined by Table II. We can 
enforce that two variables, x and y, labeled Lk and L; in their respective subqueries, 
cannot both match the topmost node labeled Lk resp. L; in the left, respective right, 
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k\l 



1 



2 



3 



1 
2 
3 



10 
5 
2 



13 
8 
5 



18 
13 
10 



Table II. The function NAND{k, I). 



part of the data tree by adding an atom of the form Following' 
query. 



NAND(k,l) 



(x, y) to the 



For each pair of clauses Cj, Cj, variable x such that Qi (resp., Q^) contains the 



These query atoms make sure that if a literal is chosen to be true in one clause, 
it must be selected to be true in all other clauses as well. In the case that i= j, the 
idea is to make sure that both copies of the query gadget of each clause, Qi and Q', 
make the same choice of selected literal. The case that i ^ j models the interaction 
between distinct clauses. Thus our query assures that each literal is assigned the 
same truth value in all clauses. 

Using two copies of the query gadget for each clause and two copies of the tree 
gadget of Figure 5 (a) in the data tree is necessary, as we cannot use Following- 
atoms to make sure that two variables arc not both assigned their topmost positions 
in the data tree (corresponding to "true") if the data tree consists just of the tree 
of Figure 5 (a) and these two topmost positions in the data tree coincide. 

This concludes the construction, which can be easily implemented to run in 
logarithmic space. It is not difficult to verify that the fixed data tree satisfies the 
query precisely if the l-in-3 3SAT instance is satisfiable. □ 

Theorem 5.3. Conjunctive queries over the signatures 



are NP-complete w.r.t. query complexity. 

Proof. The same encoding as in the previous proof can be used, with the only 
difference that Child* resp. Child^ is used instead of Child in the query. In fact, if 
the topmost position for vi (resp. W2, ws) is chosen, there are two possible matches 
for "A" (resp. three for "B" and two for "C" ) . This has no impact on the constraints 
across clauses or the constraints that at most one variable of each clause is assigned 
to its topmost position. To make sure that at least one variable of each clause is 
assigned its topmost position, the constraints of the query assure that either "A", 
"B" , or "C" are assigned to the correspondingly labeled node at depth two in the 
subtree of the clause (rather than depth three). □ 

Since Following can be defined by a conjunctive query over Chil(t and NextSibling'^ 
(see Equation (1) in Section 2), 




Tf := {{Labela)a^i:-,Child^ , Following), 
Tg := {{Lahela)aeiiTChil(t , Following) 
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Fig. 6. Data tree of proof of Theorem 5.6. 

Corollary 5.4. Conjunctive queries over the signature 

Tg := {{Labela)ae^j Child* , NextSibling'^) 

are NP-complete w.r.t. query complexity. 

Theorem 5.5. Conjunctive queries over the signature 

Tio := {{Labela)ae's^ Chilct , NextSibling) 

are NP-complete w.r.t. query complexity. 

Proof. If we replace Following by 

Following {x , y) := 3zi3z2 Child* {zi,x) A NextSibling{zi, Z2) A Child* {z2 , y) , 

we can reuse the construction of the proof of Theorem 5.2 (in the modified form of 
the proof of Theorem 5.3). □ 

Theorem 5.6. Conjunctive queries over the signature 

Til {{Labela)aes, Child* , NextSibling*) 
are NP-complete w.r.t. query complexity. 

Proof. The proof basically uses the same argument as Corollary 5.4. However, 
to deal with NextSibling* rather than NextSibling'^ , we need a way to ensure that 
NextSibling* moves at least one step to the right. We thus replace each occurrence 
of Following in the construction of the proof of Theorem 5.2 by 

Following {x, y) := 3zi3z23zs Child*{zi,x) A NextSibling* {zi , Z2) A 

H{z2) A NextSibling* {z2, Z3) A Child!* {z3,y). 

The modified data tree is as shown in Figure 6. It uses specially labeled auxiliary 
nodes inserted between each pair of adjacent siblings in the data tree of the proof 
of Theorem 5.2. □ 
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L2 L'2 

(a) 

Following^^ 




(b) 

Fig. 7. Encoding the selection of exactly one of the positive literals of a clause as a conjunctive 
query over signature ns . 

Theorem 5.7. Conjunctive queries over the signatures 

Ti2 '■= {{Labela)a(^T.T Child^ , NextSihling) , 
Ti3 := {{Lahela)aeT.,Child^ ^NextSibling'^), 
Tii '■= {{Lahela)aei^-,Child^ ,NextSibling*) 

are NP-complete w.r.t. query complexity. 

Proof. The proofs are analogous to the proofs for the respective signatures 

with Child* rather than Child^ , except that wc modify the respective data trees as 
follows: Each edge {u,w) is replaced by two edges {u,v), {v,w), where is a new 
node. Now, to make a Following-step between two nodes corresponding to original 
tree nodes, we can use the relation 

Following" {x,y) := 3zi3z2^zs Child^{zi,x) A NextSibling"{z2,zs) A Ghild^{zs,y). 

where a is "1" for ri2, "+" for ns, and "*" for m. □ 

Theorem 5.8. Conjunctive queries over the signatures 

Ti5 {{Labela)ae^7 Following, NextSibling), 
Tie '■= {{Labela)aeT:, Following, NextSibling'^), 
Ti7 :— {{Labela)aes, Following, NextSibling*) 

are NP-complete w.r.t. query complexity. 
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Proof. We first look at signature T15. Consider the data tree siiown in Fig- 
ure 7 (a) and the query of Figure 7 (b). 

As in the proof of Theorem 5.2. there is again one variable per label Li (L2, ^3), 
which we call vi {v2, V3). Again, at most one variable vi, V2, and 113 can be mapped 
to its topmost position. The query shown in Figure 7 (a) requires that precisely 
the partial assignments 



divi) 


:=1, 


0{V2) 


= 4, 


e{v:i) 


:= 7 




:=2, 


9{V2) 


:= 3, 




:= 7 


0{vi) 


:=2, 


9{V2) 


:= 5, 


0{V3) 


:=6 



can be extended to solutions of the query. 

This provides us with an encoding for the selection of exactly one literal from a 
given clause with three positive literals. The full reduction from l-in-3 3SAT over 
positive hterals can be obtained analogously to the proof of Theorem 5.2. 

The same reduction can be used to prove the corresponding result for the signa- 
tures Tie and ny. □ 

6. EXPRESSIVENESS 

In this section, we study the expressive power of conjunctive queries over trees. The 
main result is that for each conjunctive query over trees, an equivalent acyclic posi- 
tive query (APQ) can be found. However, these APQs are in general exponentially 
larger. As we show in Section 7, this is necessarily so. 

We introduce a number of technical notions. In Section 2, query graphs were 
introduced as directed (multi)graphs. Below, we will deal with two kinds of cycles 
in query graphs; directed cycles, the standard notion of cycles in directed graphs, 
and the more general undirected cycles, which are cycles in the undirected shadows 
of query graphs.^ The standard notion of conjunctive query acyclicity in the case 
that relations are at most binary refers to the absence of undirected cycles from 
the shadow of the query graph. 

Let F C Ax be a set of axes. We denote by CQ[F] the conjunctive queries over 
signature {(Labela)aeT,j F) ■ By PQ[F] we denote the positive (first-order) queries 
(written as finite unions of conjunctive queries) over F. We denote the acyclic 
positive queries - that is, unions of acyclic conjunctive queries - over F by APQ[i^]. 

Remark 6.1. Given a set of XPath axes F, let F~^ denote their inverses (e.g., 
Parent for Child; see [World Wide Web Consortium 1999] for the names of the 
inverse XPath axes). It is easy to show that for any set F of XPath axes, positive 
Core XPath[FUi^~^], the positive, navigation-only fragment of the XPath language 
[Gottlob et al. 2005], captures the unary APQ[i^] on trees in which each node has 
(at most) one label. No proof of this is presented here because a formal definition of 
XPath is tedious and the result follows immediately from such a definition. (Positive 
Core XPath queries are acyclic and support logical disjunction.) 

Before we can get to the main result of this section, Theorem 6.6, we need to 
define the notion of a join lifter, for which we will subsequently give an intuition 



**The shadow of a directed graph is obtained by replacing each directed edge from node u to node 
V by an undirected edge between u and v. 
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and an example. After providing two lemmata, we will be able to prove Theo- 
rem 6.6. The proof of the main result employs a rewrite system whose workings 

are illustrated in a detailed example in Figure 8 (Example 6.8). The reader may 
find it helpful to start with that example before reading on sequentially from here. 

Definition 6.2. Let F be a set of binary relations. A positive quantifier-free 
formula tl>R,s{x, y, z) in Disjunctive Normal Form (DNF) is called a join lifter over 
F for binary relations R and S if 

(1) each conjunction of iljR,s{x,y, z) is of one of the following five forms: 

(a) P{x,y)AP'{y,z) 

(b) P{y,x)AP'{x,z) 

(c) P{x,z)Ay = z 

(d) P(y, z) Ax = z 

(e) P{x, z) Ax = y 
where P,P' e F and 

(2) for all trees A and nodes a, b, c, 

{A,a,b,c)\= (j)R^s [a, b, c] <^=^ {A, a, b, c) \= ipR^s [a, b, c] . 

where (I>r,s{x, y, z) = R{x, z) A S{y, z). 
(Subsequently, we will write this as V'k.s = '^R,s^ 

A join lifter i)R^s can be used to rewrite a conjunctive query Q that contains 

atoms R{x, z), S{y, z) the role of such pairs of atoms will be clarified below, in 
the proof of Lemma 6.5- into a union of conjunctive queries (one conjunctive query 
for each conjunction C of the DNF formula iIjr,s, by replacing R{x,z),S{y,z) by 
C) such that none of the conjunctive queries obtained is larger than Q. In fact, 
each of conjunctive queries obtained is either shortened (because equality atoms 
V = w in conjunctions of form (c), (d) or (e) can be eliminated after substituting 
variable v hy w everywhere in the query) or the join on z is intuitively lifted "up" 
in the query graph using a conjunction of form (a) or (b). 

Example 6.3. The formula 

i>Chiid,NextSibiingix,y, z) = Child{x,y) A NextSibling{y,z) 

is a join lifter for Child and NextSibling hccanse it satisfies the syntactic requirement 
(1) - the formula is a single conjunction of form (a) - and the equivalence (2) 

ipChtid^NextSibUngix, y, z) = (f) chiidMext3ibhng{x , y, z) = Child{x, z) A NextSibling{y , z). 

of Definition 6.2. Conjunctions of form (a) such as this one lift the join occur- 
ring in (t>child,NextSibling One level up in the query graph - here from variable z in 

(t> Child, NextSibling 

to variable y in ip child ,NextSibling when rewriting (l> child, NextSibling by 

'4'Child,NextSibling- 

Moving joins upward is only meaningful in queries whose query graphs do not 
have directed cycles. As demonstrated by the following lemma, such cycles can 
always be eliminated. 
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Lemma 6.4. Let Q be a C(5[Ax] that contains a directed cycle 

Rl{xi,X2),R2{x2,X3), . . . ,Rk-l{xk-l,Xk),Rkixk,Xi). 

If Ri, . . . , Rk e {Child* , NextSibling*} , then Q is equivalent to the query obtained 
by adding xi = X2 = ■ ■ ■ = Xk to the body of Q. Otherwise, Q is unsatisfiable. 

Proof. The graph of the relation Child U NextSibling U Following is acyclic. 
Therefore, a query with a cycle can only be satisfied if all variables in the cycle are 
mapped to the same node. If the cycle contains an irreflexive axis (any axis besides 
Child* and NextSibling*), the query is unsatisfiable. □ 

Lemma 6.5. Let F C Ax be a set of axes and let there be join lifters tpR,s over 
F for each pair {R. S) of relations in F. Then, each CQ[F] can be rewritten into 
an equivalent APQ[F] in singly exponential time. 

Proof. Given a conjunctive query Qo, we execute the following algorithm. Let 
Q be a set of conjunctive queries, initially {Qo}- Repeat the following until the 
query graphs of all queries in Q are forests. 

(1) Choose any conjunctive query Q from Q whose qiiery graph is not a forest. 

(2) If Q contains a directed cycle in which a predicate other than NextSibling* or 
Child* appears, Q is unsatisfiable (by Lemma 6.4) and is removed from Q. 

(3) For each directed cycle in Q that consists exclusively of Child* and NextSibling* 
atoms, we identify the variables occurring in it. (That is, if a;i,...,.T„ are 
precisely all the variables of the cycle, we replace each occurrence of any of these 
variables in the body or head of Q by xi.) Atoms of the form Child* {x\ , xi) or 

NextSibling* (xi, xi) are removed. 

In order to assure safety, we add an atom Node{xi) if Xi now does not occur in 
any remaining atom. (The predicate Node matches any node and can be defined 
ds, R{xi,x'i), where i? is a predicate of the directed cycle just eliminated - either 
Child* or NextSibling* - and x[ is a new variable.) 

By Lemma 6.4, the outcome of this transformation is equivalent to the input 

query. 

(4) Now there are no directed cycles left in the query graph, but undirected cycles 
may remain. If Q contains undirected cycles, we choose a variable z that is 
in an undirected cycle such that there is no directed path in the query graph 
leading from z to another variable that is in an undirected cycle as well. (Such 
a choice is possible because there are no directed cycles in the query graph.) 
The cycle contains two atoms R{x, z), S{y, z). 

Now, we use join lifter ipB^.s to replace these two atoms. Let ijjn„s be the DNF 

^fl S V • • • V V'ij^s such that the '0i?''s ^'"^ conjunctions of atoms. We create 

copies Qi, . ■ . ,Qk of Q and replace R{x, z), S{y, z) in each Qi by V^''^. If 
contains an equality atom v = w, we replace each occurrence of variable w in Qi 
by V and remove the equality atom. Finally, we replace Q in Q hy Qi, ... ,Qk- 

First we show that this algorithm indeed terminates. The elimination of directed 
cycles - steps (2) and (3) - is straightforward, but we need to consider in more 
detail how the algorithm deals with undirected cycles. The idea here is to eliminate 
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undirected cycles from the bottom to the top (with respect to the direction of edges 
in the query graph.) This is done by rewriting bottom atoms R{x, z), S{y, z) of 
undirected cycles using the join lifters iJjr,s- While R{x, z), S{y, z) are two binary 
atoms that involve z, each conjunction in join lifter tpR^s contains only one binary 
atom over z apart from a possible equality atom. Therefore, each rewrite step either 
removes z from at least one cycle or identifies z with either x or y via an equality 
atom (which, for our purposes, means to remove z entirely, and thus also from any 
cycle it appears in). 

Let \V\ be the number of variables and \E\ be the number of binary atoms in Qq. 
The number of atoms in a conjunctive query never increases by the rewrite steps 
(each conjunction of the formulae V'fl.s is of length two). For a given bottommost 
variable z of the query graph that is in an undirected cycle, there can be at most 
\E\ incoming edges (i.e., binary atoms) for z. After at most \E\ — 1 appropriate 
iterations of our algorithm, there is only one incoming edge for z ov z has been 
eliminated. Consequently, after no more than \V\ ■ \E\ iterations of our algorithm 
on a conjunctive query (in each of which a join lifter can be applied), the conjunctive 
query is necessarily acyclic. 

In each such loop, a single query may be replaced by at most k others, where k 
is the maximum number of conjunctions occurring in a join lifter - a constant (no 
greater than three in this article). Thus, we make no more than fcl^' l^l iterations 
in total until all conjunctive queries in Q are acyclic, i.e. their query graphs are 
forests. This is the termination condition of our algorithm. 

Thus, Q cannot contain more than fcl^l l^l conjunctive queries, all of size < 
IQol- Since the cycle detection and transformation procedures in (2) to (4) can be 
easily implemented to run in polynomial time each, the overall running time of our 
algorithm is singly exponential. 

The query computed by the algorithm is equivalent to Qq . This follows by induc- 
tion from the fact that the steps (2) to (4) each produce equivalent rewritings. (The 
individual arguments are provided with steps (2) to (4).) Thus, on termination, Q 
is a union of acyclic conjunctive queries an APQ equivalent to Qo- 

Note that step (4) can introduce new directed cycles into a query; therefore, it 
may be necessary to repeat steps (2) and (3) after an application of step (4), as 
done by our algorithm. □ 

Note that the rewriting technique of the previous algorithm is nondeterministic 
(by the choice of next query to rewrite in step (1)), but we do not prove confluence 
of our rewrite system since it is not essential to our main theorem, stated next. 

Theorem 6.6. (1) For F C {Ch?M, Child*, Child^}, CQ[F] C APQ[F]. 

(2) For F C {Child, Child^, NextSibling, NextSibling* , NextSibling^}, 

CQ[F] C APQ[F]. 

(3) For F C {Child, Child* , Child^ , NextSibling, NextSibling* , NextSibling'^}, 

CQ[F] C APQ[F U {Child^}]. 
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Proof. Consider the DNF formulae 

R{x, z) Ax = y 

{R{x,z)hR{y,x))y 
{R{x,y)AR{y,z)) 

{R{x,z) AR{y,x)) V 
{R{x,y)AR{y,z)) V 
{R{x, z) Ax = y) 

{R{x,z) Ay = z)V 
{R{x,z) AS{y,x)) 

{R{x,z) Ax = y)y 
{R{x,z)AS{y,x)) 

{R{x, z) Ay = z)y 

{R{x,z) AS{y,x)) V 
{R{y,z)AS{x,y)) 

R{x,z)AS{y,x) 

{R{x, z)Ay = z)y 
{R{x,z) A Child^{y,x)) 

, -^sMv^x^z) 

which are defined for all 



V'ii,s(a;,y,^) = < 



R = S € {Child, NextSibling}, 
R = Sg {Chilcf, NextSibling*} 

R = Se {Child+, NextSibling+} 



R e {Child, NextSibling}, S = R* 
R e {Child, NextSibling}, S = R+ 

R = x+,s = x* 

where x € { Child, NextSibling} 

Re {N,N*,N+},N = NextSibling, 
S e {Child, Child+} 

R e {N, N*,N+}, N = NextSibling, 
S = Child* 

otherwise 



R,S€ {Child, Child*, Child^ , NextSibling, NextSibling* , NextSibling'^} . 

The Vfl,S ^re join lifters for each R, S. The syntactic properties of join lifters of 
Definition 6.2 can be easily verified by inspection. Moreover, indeed for all R, S, 
i^R,s{x,y,z) = (j)R,s{x,y,z) = R{x,z) A S{y,z). The arguments required to show 
this are very simple and are omitted. (For example, (j)chiid,Chiid = Child{x, z)Ax = y 
because each node in a tree can have only at most one parent.) 

Thus, the iJjr^s a-rc indeed join lifters. Now observe that f/'-R.c/uW* for R S 
{NextSibling, NextSibling'^ , NextSibling*} uses the Child'^ axis, but all other 
only use the relations R and S (plus equality). Prom Lemma 6.5, it follows that for 
F such that ChiW ^ F or NextSibling, NextSibling'^ , NextSibling* ^ F, each CQ\F] 
can be translated into an equivalent APQ[F] (parts 1 and 2 of our theorem) and 
otherwise, each CQ[F] can be translated into an equivalent APQ[F U {Child^}] 
(part 3). □ 

In all three cases of Theorem 6.6, the conjunctive queries can be rewritten into 
equivalent APQs in singly exponential time. 

Similar techniques to those of the previous two proofs were used in [Olteanu 
et al. 2002] to eliminate backward axes from XPath expressions and in [Schwcntick 
2000] to rewrite first-order queries over trees given by certain regular path rela- 
tions. The special cases of Theorem 6.6 that CQ[{Child}\ C APQ[{Child}] and 
CQ[{Child, Chilct}] C APQ[{Child, Chilct}] are implicit in [Benedikt et al. 2003]. 



Example 6.7. Consider the query Qo{x,y) <— Child*{x,y) A NextSibling* {x,y). 
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Since tpNextSMing',Chm*{x,x,y) = {NextSibling*{x,y)Ax = y)V{NextSibling* {x,y)A 
Child^{x,x)), we set Q = {Q,Q'} with Q{x,x) <— NextSibling*{x,x), which is 
further simplified to Q(x,x) ^ Node{x), and Q'{x,y) ^ NextSibling* (x.y) A 
Child^{x,x)). Q' is unsatisfiable due to the directed cycle defined by its second 
atom and is removed from Q. We obtain the APQ {Q} which is equivalent to Qq. 

Example 6.8. Figure 8 illustrates the query rewriting algorithm of the proof of 
Lemma 6.5 using the join lifters of the proof of Theorem 6.6 by means of an example. 
The example query Q is that from the introduction, but since Theorem 6.6 does 
not handle the Following axis, we first rewrite it using Child* and NextSibling'^. 
All conjunctive queries that we obtain are unsatisfiable, except for one, shown at 
the bottom left corner of Figure 8. Thus, for Q there exists an equivalent acyclic 
conjunctive query. 

Note that in Figure 8 we make an exception from the conventions followed 

throughout this article by labeling the nodes of the query graphs with the vari- 
able names in order to allow for the variables to be tracked through the rewrite 
steps more easily. □ 

We complement Theorem 6.6 by two further translation theorems. 

Theorem 6.9. If Q is aCQ[F] such that 

F C {Child, NextSibling, NextSibling* , NextSibling'^ , Following), 

then Q can be rewritten into an equivalent APQ[F U {NextSibling'^);] in singly 
exponential time. 

Proof. Wc extend ijja,s from the proof of Theorem 6.6 by join lifter formulae 
for S = Following and R G F: 

i>NextSMvng,Following{x,y,z) := {NextSibling{x, z) A x = y) V 

{NextSibling{x, z) A Following{y, x)) 

^NextStbUng+,Foiiowing{x,y,z) := {N cxtSibUng'^ {x , z) A X = y) V 

{NextSibling'^ {x, z) A Following{y, x)) V 
{NextSibling'^ {x,y) A NextSibling'^ {y, z)) 

'>pNextSibling*,FoUowing{x,y,z) := {NextSibling* (x, z) A Following{y, x)) V 

{NextSibling* {x , y) A NextSibling'^ {y , z)) 
i^CMid,Foiiowing{x, y, z) := {Child{x, z) A X = y) V 

{Child{x, z) A Following{y, x)) V 
{Child{x, y) A NextSibling'^ {y, z)) 

■ipFollowing,Following{x,y,z) := {Following{x , z) A x = y) V 

{Following{x , z) A Following{y, x)) V 
{Following{x,y) A Following{y,z)) 

Each V'Jj.s is defined using only relations R, S, = and NextSibling'^ . Now the theo- 
rem follows immediately from Lemma 6.5. □ 
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Fig. 8. Translation of a conjunctive query into an APQ. 
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Theorem 6.10. If Q is a CQ[F] such that F C Ax, then Q can be rewritten 
into an equivalent APQ[F U {Child^, NextSibling^}] in singly exponential time. 

Proof. Given query Q, we first rewrite all occurrences of Following using Child* 
and NextSibling'^ using Equation (1) from Section 2. In order to be economical 
with axes, we rewrite all n occurrences of Child* using Child^ . We define an APQ 
consisting of 2" copies of Q such that in the m-th copy of Q, the fc-th Child* {x, y) 
atom is replaced by Child^{x,y) if the fc-th bit of m, represented in binary is, 
say, 1 and by a; = y otherwise (that is, all occurrences of variable y in the query 
are replaced by a;). Clearly, since Child* {x,y) <^ Child^{x,y) y x = y, the APQ 
obtained in this way is equivalent to Q. Then we apply the algorithm of the proof 
of Lemma 6.5 using the join lifters as in the proof of Theorem 6.6 (3) to each of the 
2" modified conjunctive queries and compute the union of the APQs obtained. Of 
course, the overall transformation can be again effected in exponential time. □ 

It follows that the acyclic positive queries capture the positive queries over trees. 

Corollary 6.11. PQ[Ax] = ylPQ[Ax]. 

Remark 6.12. Since Child^ and NextSibling'^ arc XPath axes ("descendant" 
and "following-sibling"), it follows from Theorem 6.10 that each unary conjunctive 
query over XPath axes can also be formulated as an XPath query. This is in contrast 
to full first-order logic (i.e., with negation) on trees, which is known to be stronger 
than acyclic first-order logic on trees rcsp. Core XPath [Marx 2005]. 

Obviously, the CQ[F] are not closed under union. On trees of one node only, 
conjunctive queries are equivalent to ones which do not use binary atoms. It is easy 
to see that the query {x \ A{x) V B{x)} has no conjunctive counterpart. 

Proposition 6.13. For any F C Ax, CQ[F] 7^ APQ[F]. 

There arc signatures with axes for which all conjunctive queries can be rewritten 
into APQ's in polynomial timc.^ 

Proposition 6.14 [Gottlob and Koch 2004]. Any CQ[{Child, NextSibling}] 
can be rewritten into an equivalent acyclic CQ[{ Child, NextSibling, NextSibling*}] in 
linear time. 

Remark 6.15. It is easy to verify by inspecting the proof in [Gottlob and 
Koch 2004] that rewriting each CQ[Child, NextSibling] into an equivalent acyclic 
CQ[Child, NextSibling] in linear time is also possible. (The proof there also deals 
with relations such as FirstChild. If these are not present, NextSibling* is not 
required.) 

7. SUCCINCTNESS 

The translations from conjunctive queries into APQs of the Theorems 6.6, 6.9 and 
6.10 run in exponential time and can produce APQs of exponential size. In this 
section, we show that this situation cannot be improved upon: there are conjunctive 
queries over trees that cannot be polynomially translated into equivalent APQs. 



^As shown in the next section, there are also signatures for which this is not possible. 
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By the size \Q\ of a Boolean conjunctive query Q, we denote the number of atoms 
in its body. The size of an APQ is given by the sum of the sizes of the constituent 
conjunctive queries. 

Let Dn denote the n-diamond Boolean conjunctive query 

n 

Dn A /\ {Chil(t{yi,Xi) AXi{xi) A Child^{xi,yi+i)A 

i=l 

Child+{y,, x'i) A X[{x[) A Child^{x[, y,+i) A Yi+i{yi+i)) . 

A graphical representation of D„ is provided in Figure 9 (a). 
The following is the main result of this section: 

Theorem 7.1. There is no family {Qn)n>i of queries in APQ[Ax] such that 
each Qn is of size polynomial in n and is equivalent to Dn . 

Before we can show this, we have to provide a few definitions. 

We use the acronyms ABCQ for acyclic Boolean conjunctive queries and DABCQ 
{directed ABCQ) for Boolean conjunctive queries whose query graphs are acyclic. 
That is, the query graph of a DABCQ is a directed acyclic graph, while the query 
graph of an ABCQ is a forest (because conjunctive query acyclicity is defined with 
respect to the undirected shadows of query graphs). By Lemma 6.4, an equivalent 
DABCQ\F] exists (and can be computed efficiently) for each Boolean CQ{F\ that 
is satisfiable, for any F C Ax. The queries Dn are DABCQ[{Child'^Y\. 

For a DABCQ Q, let IIq C Var{Q)* denote the set of variable-paths in the 
query graph of Q from variables that have in-degree zero to variables that have 
out-degrcc zero. For example, if Q is the left- and bottommost query of Figure 8, 
then n((5) — {xuy,xuvz}. We say that a label L occurs in variable-path tt S IIq 
iff there is a variable x in tt for which Q contains a unary atom L{x). 

By a path- structure, we denote a tree structure in which the graph of the Child- 
relation is a path. Given a variable-path xi. ■ ■ ■ .Xk, the associated label-path is the 
path-structure of k nodes in which the i-th node is labeled L iff Q contains atom 
L(xi). Observe that some nodes of this structure may be unlabeled, and some 
may have several labels. Given a set P of variable-paths, let LP{P) denote the 
corresponding label-paths. 

We say that a path-structure is k-scattered if (*) it consists of at least k nodes, 
(*) each node has at most one label, (*) no two nodes have the same label, and (*) if 
node V has a label and node v' (v ^ v') either is the topmost node, the bottommost 
node, or has a label, then the distance between v and v' is at least k. 

In order to prove our theorem, we need two technical lemmata. The first states, 
essentially, that on sufficiently scattered path structures, each ABCQ is equiva- 
lent to an ABCQ that only uses the axes Child^ and Child* . This is somewhat 
reminiscent of results on the locality of first-order queries (cf. e.g. [Libkin 2004]). 

Lemma 7.2. Let Q be an ABCQ[A.'x\ that is true on at least one \Q\-scattered 
path- structure. Then, there is an ABCQ[{Child^ , CMd*}] Q' such that Q' C Q, 
\Q'\ < \Q\, (ind Q' is true on all \Q\-scattered path-structures on which Q is true. 

The second lemma states that two DABCQ[{Chil<f , Child^]] Q and Q' with 
LP(n.Q) 7^ LP{IIq') differ in the sets of path structures on which they are true. 
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Fig. 9. Query D„ (a) and path structures PS{n,p{n)) (b). 

Lemma 7.3. Let Q and Q' be two DABCQ[{Child* , Child'^}] and T be a set of 
labels. If there is a label-path in LP(IIqi) in which all labels from T occur, but 
there is no label-path in LP{J1q) in which all labels from T occur, then there is a 
path- structure M. on which Q is true but Q' is not. 

Now we can prove our theorem. The proofs of the two lemmata follow at the end 
of the section. 

Proof of Theorem 7.1. By contradiction. Assume there is a (Boolean) APQ 
Q, that is, a finite union of ABCQs, which is equivalent to Dm and that Q is of 
size bounded by polynomial p{n). Let s be a path of p{n) unlabeled nodes. The 
regular expression 

s.Yi.s.{Xi.s.X[ I X[.s.Xi).s.Y2.s.{X2.s.X'^ \ X'^.s.X.2).s.Yi.s.- ■ ■ . 

S.Yn.S.{Xn.S.X'^ I X'^.S.Xn).S.Yn+l.S 

defines a set of p(n)-scattered path-structures over alphabet 

S = {Xi, . . . ,Xn,X[, . . . ,X'^,Yi, . . . ,1^+1}, 

as sketched in Figure 9 (b). We refer to the set of these structures as PS{n,p{n)). 
It is easy to see that £>„ is true on each of the structures in PS{n,p{n)). 

There are 2" structures in PS{n,p{n)) and £)„ is true on all of them, but there 
are no more than p{n) ABCQs in Q. Therefore, there is an ABCQ Q € Q which is 
true for at least 2"~^°sp("-) structures in PS{n,p{n)) and is not true on any structure 
on which Z)„ is not true. 

As the path structures in PS{n,p{n)) are (p(n) > |(3|)-scattered, by Lemma 7.2, 
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there is an ABCQ[{Child+ , Child*}] Q' with \Q'\ < \Q\, Q' C Q, and which is true 
on all structures in PS{n,p{n)) on which Q is true. 

In each path-structure A of PS{n,p{n)), for any 1 < j < n. precisely one node 
V is labeled Xj and precisely one different node w is labeled X'j . Thus if query Q' 
contains unary atoms Xj{xj) and X!j{Xj), a mapping 9 can only be a satisfaction 
of Q' on A if 0{xj) = V and 0{x'j) = w. But if there is a variable-path in Hq/ with 
Xj above x'j (respectively, x'j above Xj) and w above v (respectively, v above w) in 
A, no satisfaction of Q' on the structure can exist. Let ii, . . . , be precisely those 
pairwise distinct indexes for which, for 1 < j < m, Q' does not contain a variable- 
path containing two variables Xi^ and x'^. such that Xi- {xi- ) and X-. {x'^. ) are unary 
atoms of Q'. Then is true on at most 2™ path-structures of PS{n,p{n)). 

We assumed that Q is true on at least 2"^^°sp(") structures of PS{n,p{n)) and 
showed that Q' is true on the same. But then m > n — log p{ri). 

Since the (undirected) query graph of Q' is a forest, the number of paths in 
Hqi is not greater than the square of the number of its variables. As \Q'\ < p{n), 
\llQ,\<p{nf. 

Now, if n > 3 • log p{n), then m > 2 ■ log p{n) and there are more choices 

r = {E,G {Xi, ,xlj,...,Eme{Xi^, x'ij} 

than there are paths in Hq'. Assume there are two distinct such choices r,r' and 
a variable-path tt G Hqi such that all labels of F U F' occur in tt. Then there is an 
index ij such that Xi. , X^. S F U F'. This is in contradiction to the assumptions we 
made about the indexes ii,. . . ,im- Thus there must be (at least) one such choice 
F such that no single path in Hq/ exists in which all the labels of F occur. Since 
there is a path in D„ which contains all the labels of F, by Lemma 7.3, there is a 
model A4 of Q' which is not a model of Since Q' C M is also a model of Q. 

This is in contradiction with our assumption that Q C Consequently, for 
n > 3 • log p{n), there cannot be an ABCQ of size bounded by polynomial p{n) that 
is contained in and is true on exponentially many structures of PS{n,p{n)). It 
follows that for sufficiently large n there cannot be an APQ equivalent to Z)„ that 
is of polynomial size. □ 

Now it remains to prove the two technical lemmata we used in the proof of our 
succinctness result. 

Proof of Lemma 7.2 

We say that a Boolean query Q' is a faithful simplification of a Boolean query Q 
w.r.t. a class of structures A if \Q'\ < \Q\, Q' C Q, and Q' is true on structures of 
A on which Q is true. 

Below, by Gq, we refer to the directed graph obtained from the query graph of 
Q by removing all edges besides the Child edges. We will in particular consider 
the connected components of this graph, subsequently called the Gg-components. 
We say that Cq is a parent component of connected component C of such a graph 
iff there is a variable x in Co and a variable y in C such that there is an atom 
Child^ {x,y) or Child* {x,y) in Q. (Of course, C ^ Cq because the query graph of 
Q is a forest.) The ancestors of a component are obtained by upward reachability 
through the parent relation on Gg-components. 
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Fig. 10. A query (a) and the query embedded into a path structure with B as high above A as 
possible (b). The unlabeled edges are child-edges. 

Lemma 7.2 immediately follows from the following four lemmata. 

Lemma 7.4. Let Q he an ABCQ[Ax] that is true on at least one path structure. 
Then there is an ABCQ[{Child, Child*, Child'^}] Q' that is a faithful simplification 
of Q w.r.t. the path- structures and in which each Gq' -component is a path. 

Proof. Query Q cannot contain a NextSibling, NextSihling^ , or Following-atom., 
because if it docs, Q is false on all path-structures, contradicting our assumption 
that Q is true on at least one path-structure. 

Let Q' be the query obtained from Q by iteratively applying the following three 
rules until a fixpoint is reached. 

— if Q' contains atom NextSibling* {x,y), remove it and substitute all occurrences 

of variable t/ in Q' by x; 
— if there are atoms Child{x, z), Child{y, z) in Q', remove Child{y, z) and substitute 

every occurrence of y in Q' by x; 

— if there are atoms Child{x, y), Child{x, z) in Q' , remove Child{x, z) and substitute 
every occurrence of z in Q' by y. 

It is easy to verify that Q' is a faithful simplification of Q w.r.t. the path struc- 
tures. Moreover, none of the rewrite rules can introduce a cycle into Q' , thus 
it is an ABCQ[{Child, Child*, Child^}]. There are neither atoms Child{x,y) and 
Child{x,z) with y ^ z nor atoms Child{x,z) and Child{y,z) with x ^ y m Q' , so 
each Gq/ -component is a path. □ 

Since each Gg-component is a path, we may give the k variables inside Gq- 
component G the names , . . . , x^. We use |G| to denote the number of variables 
k in Gg-component G. We will think of the node names of a path structure as an 
initial segment of the integers, thus v > w if and only if v is below w in the path. 

Lemma 7.5. Let Q he an ABCQ[{Child, Child* , Child'^}] in which each Gq- 
component is a path and that is true on at least one \Q\-scattered path structure 
A. Then, 
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(a) any Gq -component contains at most one label atom and 

(b) ifCi,..., Cm is a path of Gg-components and Q contains unary atoms L{x^^ ) 
and L'{xf"^) with L^L' , then the node labeled L in A is above the node labeled 
L'. 

(c) if Ci, . . . , Cm is a path of Gq -components and Q contains unary atoms L{x'^^) 
and L{x^"^ ), then for no 1 < j < m and L' ^ L there can be a unary atom 

Proof, (a) Assume that there is a Gg-component C with two label atoms, 
Child{x?, x^), Child{xfc\-i,xfc\), L{x^), L'ixf) 

with either L ^ L' or k I, a jQI -scattered path-structure A, and a satisfaction 
6* of Q on A. Since |C| < \Q\ - 2, \e{x'^) - e{xf')\ < \Q\ - 2. However, A is a 
IQj-scattered path-structure and thus cannot contain two labels on a subpath of 
length \Q\. Contradiction. 

(b) Let be a satisfaction of Q on A. Assume that 0{x^^) > 9{xf'"^), i.e., the 
node labeled L is below the node labeled L' in A. Then, for each 1 < i < m, 
there is an atom R{x^^ ,x^^^^) in Q with R either Child^ or Chil(t . So 0{x^^) < 
e{x^'+^) and consequently e{x^') < 6'(a;|^+\|) = e{xf+^) + |Ci+i| - 1. But then 

6'(x^i) -6'(.xf") < E™ i(|C,| - 1) < IQI- (See Figure 10 for an illustration.) This is 
in contradiction with our assumption that .A is a |(5|-scattered path structure and 

thus \e{x'^^) - e{xf"^)\ > \Q\. 

(c) follows immediately from (b) and the fact that in a -scattered path struc- 
ture each label occurs at at most one node. □ 

Below, we will call the Gg-components of an ABCQ[{Child, Child*, Child^}] Q 
successor-repellent if for any two atoms R{x, y), R'{x', y') in Q with x = x' ,y ^ y' or 
x ^ x' ,y = y' , neither R = Child nor R' = Child. The naming of this term is due to 
the following fact: Let Q be a successor-repellent ABCQ[{Child, Child*, Child'^}]. 
Then for any two components C, C such that C is a successor of C and for any 
satisfaction 6 of Q (on a path structure), ^(a;|^|) < 9{xi ). 

Lemma 7.6. Let Q be an ABCQ[{Child, Child*, Child^}] that is true on at least 
one \Q\-scattered path structure and in which each Gq- component is a path. Then 
there is an ABCQ[{Child, Chilcf, Child'^}] Q' that is a faithful simplification of Q 
w.r.t. the \Q\-scattered path- structures and whose Gq' -components are successor- 
repellent. 

Proof. We construct the query Q' as follows. Initially, let Q' := Q. As often 
as possible, for any path of Gg-components Ci, . . . , Cm such that there are atoms 

R^{xf^\x';p, . ..Rm-iixf-^\xfrJ,L{x'^^),Lixf-) 

in Q' with Ri, . . . Rm-i & {Child'^ , Child*}, but there is no label atom over compo- 
nents G2, . . . , Gto-1, replace all occurrences of variable xf'" in Q' by a;^^ and delete 
atom Rm-i{XjJ^~^ jX^,"^ ). Moreover, for each l<i<m — 1, i{Ri = Chil(f , re- 
move the atom Chilct {x^* ,x^i'*^) and substitute x'^,*^^ by x^^ and if Ri = Child^, 

30 




Fig. 11. A path of components with two occurrences of the same label (a), the same query after 
atom replacement and variable substitution (b), and (c) the query after applying the algorithm 
of Lemma 7.4. The unlabeled edges are child-edges. 



replace i?, by Child. Note that this query is an ABCQ. Then apply the algorithm of 
Lemma 7.4 to turn the Gq' -components into paths. (See Figure 11 for an example 
of this construction.) To conclude with our construction, we replace each atom 
R{x'^' ,xf^) of Q', where R is either Child^ or Child* , by R{x'^^_yX^'). 

Clearly, Q' is a successor-repellent ABCQ with \Q'\ < \Q\. Since Q is true on at 
least one |(5 [-scattered path structure, it follows from Lemma 7.5 (a) and (c) that 
there are no two Gg'-components Ci,C'm such that both are labeled L and Ci is 
an ancestor of Cm- 
It is also easy to verify that Q' C Q: Given an arbitrary tree structure A 
and a satisfaction 9' for Q' on A, we can construct a satisfaction of Q as 
X I— *■ 0'{y) if the construction of Q' from Q substituted x hy y and x 0'{x) 
otherwise. That is a satisfaction is obvious for all atoms of Q apart from atom 
.T^™ )• which we simply deleted. But it is not hard to convince 

oneself that if i?„i_i(^^(.T^'"^^^), ^^(x^'" )) does not hold, then it must be true that 
for every satisfaction 9a of Q on A, Child* {9{x'^'"^^),6o{x'^™^^)) and therefore 
Child^{9{x'rr ),9o{x'rr )). But then, 9o{x'^^) / ^o(a;f") and A must have at 
least two distinct nodes labeled L. This is in contradiction with our assumption 
that Q is true on at least one |Q|-scattcrcd path structure. 

Moreover, Q' is true on all |Q|-scattered path-structures on which Q is true. To 
show this, let Q be true on some |Q|-scattered path-structure A with satisfaction 
6. We construct a satisfaction 6' for Q' on A using the following algorithm. 

1 for each Gg/ -component Cj do 

/ / process components according to some topological ordering w.r.t. 

// Child^, Child*: if there is an atom Child^ {x'j^\xf^ ) or Child^ {x'j^\ xf^ ) 

II in Q', then ^'(a;f*), . . . , Q'{x^^,^ has been computed before. 

2 begin 

3 if Q' contains a label atom L{x^ then 

1 1 6{x, ^ ) is the unique node of the path structure which has label L; 

c ■ c 
5 else 6'{xi^) := 1 -|- max{^'(a;|(4.|) | G, is a parent Ggz-component of Gj}; 
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// let max(0) = 



6 for the remaining 1 < < |Cj| do 

7 e'{x'^'):=0'{xf')+k-l; 

8 end; 

Clearly, this algorithm defines 6' for all variables of Q'. Since for any x, 9'{x) 
cannot be greater than max{w | path-structure node v has a label} + \Q'\, 0' maps 
into the -scattered) path structure. 

Lines 6-7 assure that all the Childr&tovas of Q' are true. Line 4 assures that the 

label- atoms arc true: otherwise, 9 could not be a satisfaction of Q. For a component 
Cj without a label-atom, line 5 assures that all atoms of the form R{x^i^^,Xi^), for 

R either Child^ or Child*, are satisfied because 0'{xf^) = (?'(a;|^.|) -|- 1. 

Finally, lines 3-4 handle the case that component Cj contains a label atom L{xi ^ ) . 
By Lemma 7.5 (a) the choice of label atom for the component in line 3 is determin- 
istic. What has to be shown is that 

> 1 -f max{^^'(.-E|^ |) I Cj is a parent Ggz-component of Cj}. 

It is easy to verify by induction that 

max{6''(x|^,|) | Ci is a parent Gq' -component of Cj} < v + \Q\ — \Cj\, 

where v is the bottommost among the nodes of the path structure carrying labels 
Lq that appear in the ancestor components of Cj . Thus, if all these labels Lq occur 
above 9{xi we are done. (In a |(5]-scattcrcd path structure, \0{xi ^) — v\ > \Q\.) 

Wc know that by our construction, label L docs not occur in any of the ancestor- 
components of Cj . But then, if all labels that occur in ancestor-components of Cj 
differ from L, by Lemma 7.5 (b) the path-structure node v must be above 6'{xi^), 
otherwise 6 would not be a satisfaction of Q. □ 

Lemma 7.7. Let Q be an ABCQ[{Child, Child*, Child'^}] such that the compo- 
nents of Cq are successor-repellent and each Cq -component contains at most one 
label-atom. Then, the query Q' obtained from Q by replacing each occurrence of 
predicate Child by Child^ is equivalent to Q. 

Proof. Since Child C Child^, it is obvious that Q C Q'. For the other direction, 
let 0' be any satisfaction of Q'. We define a valuation for Q from 6'. For every 
Gg-component C, let 9(x^) :— 9'{xf)-\-k — l if there is a label-atom over variable xf 
- as shown above, there is at most one such variable per component - or 9{x'^) := 
6'{xi) + k— \ if component C does not contain a label-atom. It is now easy to verify 
that 9 is indeed a satisfaction for Q: The label- and Child-&toras of Q are satisfied 
by definition. Since 9{x^^^^) < 9'{x^^^^) and 9{x^') > 9'ix'^'), Ri9'{xf^^^),9'ix'^')), 

where R is either Child* or Child^ , implies R{6{x^^,^,9{xi^)). Thus, Q' C Q and 
consequently Q' = Q. □ 

Proof of Lemma 7.3 

Proof of Lemma 7.3. We define a number of restrictions of the set Ilg of 
variable-paths in Q. For labels X, let IIqIx denote the set of variable-paths in Ilg 
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which contain a variable with label X. Let nQ|-,jf = IIq — IIqIjc and IIqI^a^ = 
IIqI^ n IIqI^. For variables x, let ^q\x denote the set of all variable-paths in IIq 
in which x occurs. Let LC(tJj) denote the label-paths in LP{IIq\^) concatenated 
in any (say, lexicographic) order. 

Let r = {El, . . . , Em}- There is a variable-path xi. - ■ ■ .Xk G Hg/ and query Q' 
contains atoms Ei{xi^), . . . ,Ern{xi^) such that, w.l.o.g., 1 < «i < • • • < im < fc- 
By assumption, there is no such variable-path in Hg. 

Construction of path- structure M. . Wc define AA as the path structure 

LC{^Ei).LC{Ei A -^E2).LC{Ei A E2 A -^E^). ■ ■ ■ .LC{Ei A ■ ■ -AE-m-i A -^Em) 

Since ^Q\Eif\---f\Em ^nipty, is a concatenation of all paths in LP{JIq). 

M. is a model of Q. We show that Q is true on any concatenation of the label- 
paths of LP{Uq). Consider the partial function 6 from variables of Q to nodes in 
A4 defined as 

0{x) = V <^ V is the topmost node in A4 such that for all tt.x.tt' £ HqIj,, 
n.x can be matched in the path from the root of M to v. 

We say that a variable-path tt G IIq can be matched in a subpath tt' of a path 
structure iff each of the variables a; in tt can be mapped to a node a{x) in tt' such 
that if L{x) is an atom in Q, a{x) carries label L, and if x occurs before y in tt, 
a{x) occurs before a{y) in tt'. 

As is a concatenation of all paths in LP{IIq), for each x, the label-paths of 
all prefixes of paths in IIq{x) occur in A4. Thus 9 is defined for all variables in Q. 

The valuation 9 is also consistent. By definition, 6 satisfies all unary ("label") 
atoms. Consider a binary atom Child^{x,y) or Child* {x,y). (Thus there is a path 
■K.x.y.n' e IIq.) Assimic that 9{x) = v and 9{y) = w. By definition, v is the 
topmost node such that all variable-paths with a prefix ttq-x can be matched in the 
subpath of M from the root to v. For each such ttq.x, WQ.x.y must match the path 
from the root of A4 to w. Thus, w must be below v in M. 

M. is not a model ofQ'. Assume there is a satisfaction ^ of Q' on A^. 

(1) By definition, 6{xi^) cannot be a node in LC{^Ei). 

{j ~^ j + 1) Induction step: Assume that 9{xi.) cannot be a node in the prefix 
LC{^Ei). ■ ■ ■ .LC{EiA ■ ■ ■AEj_iA^Ej) ofM. For to be a satisfaction, 9{xi.^^) 
must either be a descendant of 9{xi-) or 9{xi._^_^) = 9{xi.). By the induction 
hypothesis, 0(xi^^ J cannot be in LC{^Ei). ■ ■ ■ .LC{EiA ■ ■ ■ A Ej^i A^Ej). But 
by definition 9{xi.^^) cannot be a node in LC{Ei A ■ ■ ■ A Ej A ^Ej+i) either. It 
follows that 9{xy_^^ ) cannot be a node in LC{-^E{). ■ ■ ■ .LC{EiA ■ ■ -AEjA^Ej+i). 

So 9{xi^) must remain undefined. Contradiction with our assumption that is a 
satisfaction oi Q' on M. □ 

We illustrate the construction by an example. 

Example 7.8. Consider the 2-diamond query D2 shown in Figure 12 (a) and 
the ABCQ Q of Figure 12 (b). In Q there is no path that contains both Ei = 
X[ and E2 = X2, while D2 contains such a path. The path-structure M = 
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(b) (c) 
Fig. 12. Example of the path structure construction of the proof of Lemma 7.3. 

LC{^X[).LC{X[ A -^X2) constructed as described above is shown in Figure 12 (c). 
It consists of a concatenation of the two paths Y1.X1.Y2.X2.Y3 and Y1.X1.Y2.X2.Y3 
- which do not contain X[ (and which we can add to M. in any order) - with 
the path Yi.X[.Y2.X2.Y3, which contains X[ but not X2 and which is therefore 
appended to Ai after the other two paths. It is easy to see that indeed Q is true on 
M. However, D2 is false on M. (The unique occurrence of X[ in is a descendant 
of the unique occurrence of Xj.) This witnesses that Q % D^. □ 
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